# Computations
import pandas as pd
import numpy as np
import calendar
# sklearn
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Pytorch
import torch
from torch.autograd import Variable
import torch.nn as nn
import torchvision.transforms as transforms
# Visualisation libraries
## Progress Bar
import progressbar
## Text
from colorama import Fore, Back, Style
from IPython.display import Image, display, Markdown, Latex, clear_output
## plotly
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as py
from plotly.subplots import make_subplots
import plotly.express as px
## seaborn
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("paper", rc={"font.size":12,"axes.titlesize":14,"axes.labelsize":12})
## matplotlib
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from matplotlib.patches import Ellipse, Polygon
import matplotlib.gridspec as gridspec
import matplotlib.colors
from pylab import rcParams
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (17, 6)
mpl.rcParams['axes.labelsize'] = 14
mpl.rcParams['xtick.labelsize'] = 12
mpl.rcParams['ytick.labelsize'] = 12
mpl.rcParams['text.color'] = 'k'
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
In this article, we work on a dataset available from the UCI Machine Learning Repository. The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).
This dataset is based on the Bank Marketing dataset from the UC Irvine Machine Learning Repository. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in [Moro et al., 2014] (it does not include all attributes due to privacy concerns).
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
There are four datasets:
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
The zip file includes two datasets:
The binary classification goal is to predict if the client will subscribe to a bank term deposit (variable y).
Data = pd.read_csv('Data/Bank_mod.csv')
display(Data.head().round(2))
| Age | Job | Marital | Education | Default | Housing | Loan | Contact | Month | Day Of Week | ... | Campaign | Pdays | Previous | Poutcome | Employment Variation Rate | Consumer Price Index | Consumer Confidence Index | Euribor three Month Rate | Number of Employees | Term Deposit Subscription | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 56 | Housemaid | Married | Basic.4Y | No | No | No | Telephone | May | Monday | ... | 1 | 999 | 0 | Nonexistent | 1.1 | 93.99 | -36.4 | 4.86 | 5191.0 | No |
| 1 | 57 | Services | Married | High.School | Unknown | No | No | Telephone | May | Monday | ... | 1 | 999 | 0 | Nonexistent | 1.1 | 93.99 | -36.4 | 4.86 | 5191.0 | No |
| 2 | 37 | Services | Married | High.School | No | Yes | No | Telephone | May | Monday | ... | 1 | 999 | 0 | Nonexistent | 1.1 | 93.99 | -36.4 | 4.86 | 5191.0 | No |
| 3 | 40 | Admin. | Married | Basic.6Y | No | No | No | Telephone | May | Monday | ... | 1 | 999 | 0 | Nonexistent | 1.1 | 93.99 | -36.4 | 4.86 | 5191.0 | No |
| 4 | 56 | Services | Married | High.School | No | No | Yes | Telephone | May | Monday | ... | 1 | 999 | 0 | Nonexistent | 1.1 | 93.99 | -36.4 | 4.86 | 5191.0 | No |
5 rows × 21 columns
| Number of Instances | Number of Attributes |
|---|---|
| 41188 | 21 |
| Feature | Description |
|---|---|
| Age | numeric |
| Job | Type of Job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown") |
| Marital | marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed) |
| Education | (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown") |
| Default | has credit in default? (categorical: "no","yes","unknown") |
| Housing | has housing loan? (categorical: "no","yes","unknown") |
| Loan | has personal loan? (categorical: "no","yes","unknown") |
| Feature | Description |
|---|---|
| Contact | contact communication type (categorical: "cellular","telephone") |
| Month | last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec") |
| Day of week | last contact day of the week (categorical: "mon","tue","wed","thu","fri") |
| Duration | last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. |
| Feature | Description |
|---|---|
| Campaign | number of contacts performed during this campaign and for this client (numeric, includes last contact) |
| Pdays | number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |
| Previous | number of contacts performed before this campaign and for this client (numeric) |
| Poutcome | outcome of the previous marketing campaign (categorical: "failure","nonexistent","success") |
| Feature | Description |
|---|---|
| Employment Variation Rate | employment variation rate - quarterly indicator (numeric) |
| Consumer Price Index | consumer price index - monthly indicator (numeric) |
| Consumer Confidence Index | consumer confidence index - monthly indicator (numeric) |
| Euribor three Month Rate | euribor* 3 month rate - daily indicator (numeric) |
| Number of Employees | number of employees - quarterly indicator (numeric) |
* the basic rate of interest used in lending between banks on the European Union interbank market and also used as a reference for setting the interest rate on other loans.
| Feature | Description |
|---|---|
| Term Deposit Subscription | has the client Term Deposit Subscription? (binary: "yes","no") |
Dataset_Subcategories = {}
Dataset_Subcategories['Bank Client Data'] = Data.iloc[:,:7].columns.tolist()
Dataset_Subcategories['Related with the Last Contact of the Current Campaign'] = Data.iloc[:,7:11].columns.tolist()
Dataset_Subcategories['Other Attributes'] = Data.iloc[:,11:15].columns.tolist()
Dataset_Subcategories['Social and Economic Context Attributes'] = Data.iloc[:,15:-1].columns.tolist()
Dataset_Subcategories['Output variable (Desired Target)'] = Data.iloc[:,-1:].columns.tolist()
def List_Print(Text, List):
print(Back.BLACK + Fore.CYAN + Style.NORMAL + '%s:' % Text + Style.RESET_ALL + ' %s' % ', '.join(List))
def Data_Plot(Inp):
data_info = Inp.dtypes.astype(str).to_frame(name='Data Type')
Temp = Inp.isnull().sum().to_frame(name = 'Number of NaN Values')
data_info = data_info.join(Temp, how='outer')
data_info ['Size'] = Inp.shape[0]
data_info['Percentage'] = 100 - np.round(100*(data_info['Number of NaN Values']/Inp.shape[0]),2)
data_info.index.name = 'Features'
data_info = data_info.reset_index(drop = False)
#
fig = px.bar(data_info, x= 'Features', y= 'Percentage', color = 'Data Type', text = 'Data Type',
color_discrete_sequence = ['PaleGreen', 'LightCyan', 'PeachPuff', 'Pink', 'Plum'],
hover_data = data_info.columns)
fig.update_layout(plot_bgcolor= 'white', legend=dict(x=1.01, y=.5, traceorder="normal",
bordercolor="DarkGray", borderwidth=1))
fig.update_traces(texttemplate= 6*' ' + '%{label}', textposition='inside')
fig.update_traces(marker_line_color= 'Black', marker_line_width=1., opacity=1)
fig.show()
def dtypes_group(Inp):
Temp = Inp.dtypes.to_frame(name='Data Type').sort_values(by=['Data Type'])
Out = pd.DataFrame(index =Temp['Data Type'].unique(), columns = ['Features','Count'])
for c in Temp['Data Type'].unique():
Out.loc[Out.index == c, 'Features'] = [Temp.loc[Temp['Data Type'] == c].index.tolist()]
Out.loc[Out.index == c, 'Count'] = len(Temp.loc[Temp['Data Type'] == c].index.tolist())
Out.index.name = 'Data Type'
Out = Out.reset_index(drop = False)
Out['Data Type'] = Out['Data Type'].astype(str)
return Out
Data_Plot(Data)
dType = dtypes_group(Data)
First, let's convert all Yes/No columns using as follows
$$\begin{cases} -1 & \mbox{Unknown}\\0 &\mbox{No}\\ 1 &\mbox{Yes}\end{cases}$$df = Data.copy()
Categorical_Variables = dType.loc[dType['Data Type'] == 'object'].values[0,1]
YN_Feat = []
for c in Categorical_Variables:
s = set(df[c].unique().tolist())
if s.issubset({'No', 'Yes', 'Unknown'}):
YN_Feat.append(c)
del c, s
List_Print('Yes/No Features', YN_Feat)
# Converting:
Temp = {'Yes':1, 'No':0, 'Unknown':-1}
for c in YN_Feat:
df[c] = df[c].replace(Temp).astype(int)
del c
display(df[YN_Feat].head().style.hide_index())
## Adding these keys and values to a dictionary
CatVar_dict = {}
for c in YN_Feat:
CatVar_dict[c] = Temp
#substracting YN Features from Categorical_Variables
Categorical_Variables = list(set(Categorical_Variables) - set(YN_Feat))
del YN_Feat, Temp
Yes/No Features: Housing, Loan, Default, Term Deposit Subscription
| Housing | Loan | Default | Term Deposit Subscription |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | -1 | 0 |
| 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
Moreover,
List_Print('Remaining categorical features', Categorical_Variables)
Remaining categorical features: Month, Contact, Poutcome, Marital, Day Of Week, Job, Education
For these features, we have,
$$\mbox{Poutcome} = \begin{cases} -1 & \mbox{Nonexistent}\\0 &\mbox{Failure}\\ 1 &\mbox{Success}\end{cases}$$Temp = {'Success':1, 'Failure':0, 'Nonexistent':-1}
df['Poutcome'] = df['Poutcome'].replace(Temp).astype(int)
CatVar_dict['Poutcome'] = Temp
del Temp
Temp = {'Divorced':2, 'Married':1, 'Single':0, 'Unknown':-1}
df['Marital'] = df['Marital'].replace(Temp).astype(int)
CatVar_dict['Marital'] = Temp
Temp = [x for x in calendar.day_name]
Temp0 = {}
for x in np.arange(len(Temp)):
Temp0[Temp[x]] = x
del Temp
df['Day Of Week'] = df['Day Of Week'].replace(Temp0).astype(int)
CatVar_dict['Day Of Week'] = Temp0
Temp = {'Telephone':0, 'Cellular':1}
df['Contact'] = df['Contact'].replace(Temp).astype(int)
CatVar_dict['Contact'] = Temp
Temp = {'Unknown':-1, 'Unemployed':0, 'Student': 1, 'Housemaid':2,
'Retired':3, 'Blue-Collar':4, 'Self-Employed': 5, 'Services':6,
'Technician':7, 'Admin.':8, 'Management':9, 'Entrepreneur':10 }
df['Job'] = df['Job'].replace(Temp).astype(int)
CatVar_dict['Job'] = Temp
Temp = [x for x in calendar.month_name]
Temp = Temp[1:]
Temp0 = {}
for x in np.arange(len(Temp)):
Temp0[Temp[x]] = x
del Temp
df['Month'] = df['Month'].replace(Temp0).astype(int)
CatVar_dict['Month'] = Temp0
Temp = {'Unknown':-1, 'Illiterate':0, 'Basic.4Y':1, 'Basic.6Y':2, 'Basic.9Y':3, 'High.School':4,
'Professional.Course':5, 'University.Degree':6}
df['Education'] = df['Education'].replace(Temp).astype(int)
CatVar_dict['Education'] = Temp
Categorical_Variables = CatVar_dict
del CatVar_dict
df.loc[df['Pdays'] == 999, 'Pdays'] = -1
Creating new features:
We can create Age Categories using statcan.gc.ca.
| Interval | Age Category | Age Category Code |
|---|---|---|
| 00-14 years | Children | 0 |
| 15-24 years | Youth | 1 |
| 25-64 years | Adults | 2 |
| 65 years and over | Seniors | 3 |
bins = pd.IntervalIndex.from_tuples([(14, 24), (24, 64),(64, 100)])
Temp = pd.cut(df['Age'], bins)
df['Age'] = Temp.astype(str).replace({'(14, 24]':0, '(24, 64]':1,'(64, 100]':2})
Therefore,
display(df.head())
| Features | Age | Job | Marital | Education | Default | Housing | Loan | Contact | Month | Day Of Week | ... | Campaign | Pdays | Previous | Poutcome | Employment Variation Rate | Consumer Price Index | Consumer Confidence Index | Euribor three Month Rate | Number of Employees | Term Deposit Subscription |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | ... | 1 | -1 | 0 | -1 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 0 |
| 1 | 1 | 6 | 1 | 4 | -1 | 0 | 0 | 0 | 4 | 0 | ... | 1 | -1 | 0 | -1 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 0 |
| 2 | 1 | 6 | 1 | 4 | 0 | 1 | 0 | 0 | 4 | 0 | ... | 1 | -1 | 0 | -1 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 0 |
| 3 | 1 | 8 | 1 | 2 | 0 | 0 | 0 | 0 | 4 | 0 | ... | 1 | -1 | 0 | -1 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 0 |
| 4 | 1 | 6 | 1 | 4 | 0 | 0 | 1 | 0 | 4 | 0 | ... | 1 | -1 | 0 | -1 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 0 |
5 rows × 21 columns
Target = 'Term Deposit Subscription'
Labels = ['No', 'Yes']
Feat = 'Duration'
X = df[Feat].values.reshape(-1,1)
Test = np.arange(df[Feat].min(), df[Feat].max()).reshape(-1,1)
y = df[Target].values.reshape(-1,1)
logr = LogisticRegression(solver='newton-cg')
_ = logr.fit(X, y)
Pred_Prop = logr.predict_proba(Test)
fig, ax = plt.subplots(1, 1, figsize=(16,5))
# Right plot
_ = ax.scatter(X, y, color='HotPink', edgecolor = 'DeepPink')
_ = ax.plot(Test, Pred_Prop[:,1], color='MidnightBlue', lw = 1)
Temp = ax.get_xlim()
_ = ax.hlines(0, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.hlines(1, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.set_xlim(Temp)
_ = ax.set_xlabel('Last Contact Duration (in seconds)')
_ = ax.set_ylabel('Probability of %s' % Target)
_ = ax.set_title('Estimated Probability of %s using Logistic Regression' % Target )
del X, Test, y, logr, Pred_Prop, Temp
Feat = 'Pdays'
X = df[Feat].values.reshape(-1,1)
Test = np.arange(df[Feat].min(), df[Feat].max()).reshape(-1,1)
y = df[Target].values.reshape(-1,1)
logr = LogisticRegression(solver='newton-cg')
_ = logr.fit(X, y)
Pred_Prop = logr.predict_proba(Test)
fig, ax = plt.subplots(1, 1, figsize=(16,5))
# Right plot
_ = ax.scatter(X, y, color='Lime', edgecolor = 'LimeGreen', lw = 1)
_ = ax.plot(Test, Pred_Prop[:,1], color='MidnightBlue', lw = 1)
Temp = ax.get_xlim()
_ = ax.hlines(0, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.hlines(1, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.set_xlim(Temp)
_ = ax.set_xlabel('Days Passed Since the Previous Campaign Contact')
_ = ax.set_ylabel('Probability of %s' % Target)
_ = ax.set_title('Estimated Probability of %s using Logistic Regression' % Target )
del X, Test, y, logr, Pred_Prop, Temp
Let's take a look at the variance of the features.
Fig, ax = plt.subplots(figsize=(17,16))
Temp = df.drop(columns = [Target]).var().sort_values(ascending = False).to_frame(name= 'Variance').round(2).T
_ = sns.heatmap(Temp, ax=ax, annot=True, square=True, cmap =sns.color_palette("OrRd", 20),
linewidths = 0.8, vmin=0, vmax=Temp.max(axis =1)[0],
cbar_kws={'label': 'Feature Variance', "aspect":40, "shrink": .4, "orientation": "horizontal"})
labels = [x.replace(' ','\n').replace('Euribor\nthree','Euribor three').replace('\nof\n',' of\n')
for x in [item.get_text() for item in ax.get_xticklabels()]]
_ = ax.set_xticklabels(labels)
_ = ax.set_yticklabels('')
Furthermore, we would like to standardize features by removing the mean and scaling to unit variance. In this article, we demonstrated the benefits of scaling data using StandardScaler().
# Scaling
Temp = df.drop(columns = Target).columns.tolist()
scaler = StandardScaler()
_ = scaler.fit(df[Temp])
df[Temp] = scaler.transform(df[Temp])
# Variance Plot
Fig, ax = plt.subplots(figsize=(17,16))
Temp = df.drop(columns = [Target]).var().sort_values(ascending = False).to_frame(name= 'Variance').round(2).T
_ = sns.heatmap(Temp, ax=ax, annot=True, square=True, cmap =sns.color_palette('Greens'),
linewidths = 0.8, vmin=0, vmax=Temp.max(axis =1)[0],
cbar_kws={'label': 'Feature Variance', "aspect":40, "shrink": .4, "orientation": "horizontal"})
labels = [x.replace(' ','\n').replace('Euribor\nthree','Euribor three').replace('\nof\n',' of\n')
for x in [item.get_text() for item in ax.get_xticklabels()]]
_ = ax.set_xticklabels(labels)
_ = ax.set_yticklabels('')
def Correlation_Plot (Df,Fig_Size):
Correlation_Matrix = Df.corr().round(2)
mask = np.zeros_like(Correlation_Matrix)
mask[np.triu_indices_from(mask)] = True
for i in range(len(mask)):
mask[i,i]=0
Fig, ax = plt.subplots(figsize=(Fig_Size,Fig_Size))
sns.heatmap(Correlation_Matrix, ax=ax, mask=mask, annot=True, square=True,
cmap =sns.color_palette("Greens", n_colors=10), linewidths = 0.2, vmin=0, vmax=1, cbar_kws={"shrink": .6})
Correlation_Plot (df, Fig_Size = 14)
Fig, ax = plt.subplots(figsize=(17,16))
Temp = df.corr().round(2)
Temp = Temp.loc[(Temp.index == Target)].drop(columns = Target).T.sort_values(by = Target).T
_ = sns.heatmap(Temp, ax=ax, annot=True, square=True, cmap =sns.color_palette("Greens", n_colors=10),
linewidths = 0.8, vmin=0, vmax=1,
annot_kws={"size": 12},
cbar_kws={'label': Target + ' Correlation', "aspect":40, "shrink": .4, "orientation": "horizontal"})
labels = [x.replace(' ','\n').replace('Euribor\nthree','Euribor three').replace('\nof\n',' of\n')
for x in [item.get_text() for item in ax.get_xticklabels()]]
_ = ax.set_xticklabels(labels)
_ = ax.set_yticklabels('')
df.to_csv (r'Data\Bank_mod_STD.csv', index = None, header=True)
First, consider the data distribution for Term Deposit Subscription.
def Header(Text, L = 100, C1 = Back.BLUE, C2 = Fore.BLUE):
print(C1 + Fore.WHITE + Style.NORMAL + Text + Style.RESET_ALL + ' ' + C2 +
Style.NORMAL + (L- len(Text) - 1)*'=' + Style.RESET_ALL)
def Line(L=100, C = Fore.BLUE): print(C + Style.NORMAL + L*'=' + Style.RESET_ALL)
def Search_List(Key, List): return [s for s in List if Key in s]
def Table1(Inp, Feat = Target):
Out = Inp[Feat].value_counts().to_frame('Number of Instances').reset_index()
Out = Out.rename(columns = {'index': Feat})
Out['Percentage'] = np.round(100* Out['Number of Instances'].values /Out['Number of Instances'].sum(), 2)
return Out
def Plot1(data, x= 'Number of Instances', Feat = Target, CL = ['FireBrick', 'SeaGreen']):
fig = px.bar(data, x= x, y= ['',''], orientation='h', color = Feat, text = 'Percentage',
color_discrete_sequence = CL, height = 180)
fig.update_layout(plot_bgcolor= 'white', legend_orientation='h', legend=dict(x=0, y=1.7),
xaxis = dict(tickmode = 'array', tickvals = [0, Data.shape[0]], ticktext = ['','']))
fig.update_traces(marker_line_color= 'Black', marker_line_width=1.5, opacity=1)
fig.update_traces(texttemplate='%{text:.2}% ', textposition='inside')
fig.update_xaxes(title_text=None, range=[0, Data.shape[0]])
fig.update_yaxes(title_text=None)
fig.show()
Plot1(data = Table1(Data, Feat = Target))
StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.
X = df.drop(columns = Target).values
y = df[Target].astype(float).values
Test_Size = 0.3
sss = StratifiedShuffleSplit(n_splits=1, test_size=Test_Size, random_state=42)
_ = sss.get_n_splits(X, y)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
del sss
Header('Train Set')
Temp = Table1(Data, Feat = Target)
_, Temp['Number of Instances'] = np.unique(y_train, return_counts=True)
Temp['Percentage'] = np.round(100* Temp['Number of Instances'].values /Temp['Number of Instances'].sum(), 2)
display(Temp.style.format({'Percentage': "{:.2f}"}).hide_index())
Header('Test Set')
Temp = Table1(Data, Feat = Target)
_, Temp['Number of Instances'] = np.unique(y_test, return_counts=True)
Temp['Percentage'] = np.round(100* Temp['Number of Instances'].values /Temp['Number of Instances'].sum(), 2)
display(Temp.style.format({'Percentage': "{:.2f}"}).hide_index())
del Temp
Line()
Train Set ==========================================================================================
| Term Deposit Subscription | Number of Instances | Percentage |
|---|---|---|
| No | 25583 | 88.73 |
| Yes | 3248 | 11.27 |
Test Set ===========================================================================================
| Term Deposit Subscription | Number of Instances | Percentage |
|---|---|---|
| No | 10965 | 88.74 |
| Yes | 1392 | 11.26 |
====================================================================================================
Therefore, we have divided the dataset into train and test set using stratification that preserves the distribution of classes in train and test sets.
A multi-layer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The algorithm at each iteration uses the Cross-Entropy Loss to measure the loss, and then the gradient and the model update is calculated. At the end of this iterative process, we would reach a better level of agreement between test and predicted sets since the error would be lower from that of the first step.
if torch.cuda.is_available():
X_train_tensor = Variable(torch.from_numpy(X_train).cuda())
y_train_tensor = Variable(torch.from_numpy(y_train).type(torch.LongTensor).cuda())
X_test_tensor = Variable(torch.from_numpy(X_test).cuda())
y_test_tensor = Variable(torch.from_numpy(y_test).type(torch.LongTensor).cuda())
else:
X_train_tensor = Variable(torch.from_numpy(X_train))
y_train_tensor = Variable(torch.from_numpy(y_train).type(torch.LongTensor))
X_test_tensor = Variable(torch.from_numpy(X_test))
y_test_tensor = Variable(torch.from_numpy(y_test).type(torch.LongTensor))
Batch_size = 100
iteration_number = int(2e4)
epochs_number = int(iteration_number / (len(X_train) / Batch_size))
# Pytorch train and test sets
Train_set = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
Test_set = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)
# data loader
train_loader = torch.utils.data.DataLoader(Train_set, batch_size = Batch_size, shuffle = False)
test_loader = torch.utils.data.DataLoader(Train_set, batch_size = Batch_size, shuffle = False)
# Create MLP_Model
class MLP_Model(nn.Module):
def __init__(self, input_Size, hidden_Size, output_Size):
super(MLP_Model, self).__init__()
# Linear function 1:
self.fc1 = nn.Linear(input_Size, hidden_Size)
# Non-linearity 1
self.relu1 = nn.ReLU()
# Linear function 2:
self.fc2 = nn.Linear(hidden_Size, output_Size)
def forward(self, x):
# Linear function 1
out = self.fc1(x)
# Non-linearity 1
out = self.relu1(out)
# Linear function 2 (readout)
out = self.fc2(out)
return out
input_Size, output_Size = len(X[0]), len(np.unique(y))
hidden_Size = 128
# model
model = MLP_Model(input_Size, hidden_Size, output_Size)
# GPU
if torch.cuda.is_available():
model.cuda()
# Cross Entropy Loss
CEL= nn.CrossEntropyLoss()
# Optimizer
learning_rate = 1e-2
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# Traning the Model
Count = 0
Loss_list = []
Iteration_list = []
Accuracy_list = []
MSE_list = []
MAE_list = []
Steps = 10
Progress_Bar = progressbar.ProgressBar(maxval= iteration_number + 200,
widgets=[progressbar.Bar('=', '|', '|'),
progressbar.Percentage()])
# print('---------------------------------------------------------')
for epoch in range(epochs_number):
for i, (Xtr, ytr) in enumerate(train_loader):
# Variables
Xtr = Variable(Xtr.view(-1, X[0].shape[0]))
ytr = Variable(ytr)
# Set all gradients to zero
optimizer.zero_grad()
# Forward
Out = model(Xtr.float())
# loss
loss = CEL(Out, ytr.long())
# Backward (Calculating the gradients)
loss.backward()
# Update parameters
optimizer.step()
Count += 1
del Xtr, ytr
# Predictions
if Count % Steps == 0:
# Calculate Accuracy
Correct, Total = 0, 0
# Predictions
for Xts, yts in test_loader:
Xts = Variable(Xts.view(-1, X[0].shape[0]))
# Forward
Out = model(Xts.float())
# The maximum value of Out
Predicted = torch.max(Out.data, 1)[1]
# Total number of yts
Total += len(yts)
# Total Correct predictions
Correct += (Predicted == yts).sum()
del Xts, yts
# storing loss and iteration
Loss_list.append(loss.data)
Iteration_list.append(Count)
Accuracy_list.append(Correct / float(Total))
Progress_Bar.update(Count)
Progress_Bar.finish()
history = pd.DataFrame({'Iteration': np.array(Iteration_list),
'Loss': np.array([x.cpu().data.numpy() for x in Loss_list]),
'Accuracy': np.array([x.cpu().data.numpy() for x in Accuracy_list])})
del Loss_list, Iteration_list, Accuracy_list
|=========================================================================|100%
def Plot_history(history, Table_Rows = 25, yLim = 2):
fig = make_subplots(rows=1, cols=2, horizontal_spacing = 0.02, column_widths=[0.6, 0.4],
specs=[[{"type": "scatter"},{"type": "table"}]])
# Left
fig.add_trace(go.Scatter(x= history['Iteration'].values, y= history['Loss'].astype(float).values.round(4),
line=dict(color='OrangeRed', width= 1.5), name = 'Loss'), 1, 1)
fig.add_trace(go.Scatter(x= history['Iteration'].values, y= history['Accuracy'].astype(float).values,
line=dict(color='MidnightBlue', width= 1.5), name = 'Accuracy'), 1, 1)
fig.update_layout(legend=dict(x=0, y=1.1, traceorder='reversed', font_size=12),
dragmode='select', plot_bgcolor= 'white', height=600, hovermode='closest',
legend_orientation='h')
fig.update_xaxes(range=[history.Iteration.min(), history.Iteration.max()],
showgrid=True, gridwidth=1, gridcolor='Lightgray',
showline=True, linewidth=1, linecolor='Lightgray', mirror=True, row=1, col=1)
fig.update_yaxes(range=[0, yLim], showgrid=True, gridwidth=1, gridcolor='Lightgray',
showline=True, linewidth=1, linecolor='Lightgray', mirror=True, row=1, col=1)
# Right
ind = np.linspace(0, history.shape[0], Table_Rows, endpoint = False).round(0).astype(int)
ind = np.append(ind, history.Iteration.values[-1])
history = history[history.index.isin(ind)]
fig.add_trace(go.Table(header=dict(values = list(history.columns), line_color='darkslategray',
fill_color='DimGray', align=['center','center'],
font=dict(color='white', size=12), height=25), columnwidth = [0.4, 0.4, 0.4, 0.4],
cells=dict(values=[history.Iteration, history.Loss.astype(float).round(4).values,
history.Accuracy.astype(float).round(4).values],
line_color='darkslategray', fill=dict(color=['WhiteSmoke', 'white']),
align=['center', 'center'], font_size=12,height=20)), 1, 2)
fig.show()
Plot_history(history, Table_Rows = 18, yLim = 1)
The confusion matrix allows for visualization of the performance of an algorithm.
def Confusion_Matrix(Model, FG = (12, 4), X_train_tensor = X_train_tensor, y_train = y_train,
X_test_tensor = X_test_tensor, y_test = y_test):
font = FontProperties()
font.set_weight('bold')
############# Train Set #############
fig, ax = plt.subplots(1, 2, figsize=FG)
_ = fig.suptitle('Train Set', fontproperties=font, fontsize = 16)
# Predictions
y_pred = model(X_train_tensor.float())
y_pred = torch.max(y_pred.data, 1)[1]
y_pred = y_pred.cpu().data.numpy()
# confusion matrix
CM = metrics.confusion_matrix(y_train, y_pred)
CM_Train = CM.copy()
_ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Blues", ax = ax[0])
_ = ax[0].set_title('Confusion Matrix')
CM = CM.astype('float') / CM.sum(axis=1)[:, np.newaxis]
_ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Greens", ax = ax[1],
linewidths = 0.2, vmin=0, vmax=1, cbar_kws={"shrink": 1})
_ = ax[1].set_title('Normalized Confusion Matrix')
for a in ax:
_ = a.set_xlabel('Predicted labels')
_ = a.set_ylabel('True labels')
_ = a.xaxis.set_ticklabels(Labels)
_ = a.yaxis.set_ticklabels(Labels)
del CM
############# Test Set #############
fig, ax = plt.subplots(1, 2, figsize=FG)
_ = fig.suptitle('Test Set', fontproperties=font, fontsize = 16)
font = FontProperties()
font.set_weight('bold')
# Predictions
y_pred = model(X_test_tensor.float())
y_pred = torch.max(y_pred.data, 1)[1]
y_pred = y_pred.cpu().data.numpy()
# confusion matrix
CM = metrics.confusion_matrix(y_test, y_pred)
CM_Test = CM.copy()
_ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Blues", ax = ax[0])
_ = ax[0].set_title('Confusion Matrix')
CM = CM.astype('float') / CM.sum(axis=1)[:, np.newaxis]
_ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Greens", ax = ax[1],
linewidths = 0.2, vmin=0, vmax=1, cbar_kws={"shrink": 1})
_ = ax[1].set_title('Normalized Confusion Matrix')
for a in ax:
_ = a.set_xlabel('Predicted labels')
_ = a.set_ylabel('True labels')
_ = a.xaxis.set_ticklabels(Labels)
_ = a.yaxis.set_ticklabels(Labels)
del CM
return CM_Train, CM_Test
CM_Train, CM_Test = Confusion_Matrix(model)
Note that:
where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.
However, the accuracy can be a misleading metric for imbalanced data sets. Here, over 88 percent of the sample has negative (No) and about 12 percent has positive (Yes) values. In these cases, a balanced accuracy (bACC) [4] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two:
\begin{align} \text{TPR} &= \frac{T_p}{T_p + F_n},\\ \text{TNR} &= \frac{T_N}{T_p + F_p},\\ \text{Balanced Accuracy (bACC)} &= \frac{TPR+TNR}{2} = \frac{1}{2}\left(\frac{T_p}{T_p + F_n} + \frac{T_N}{T_p + F_p}\right) \end{align}Another metric is the predicted positive condition rate (PPCR), which identifies the percentage of the total population that is flagged
\begin{align} \text{Predicted positive condition rate (PPCR)}={\frac {tp+fp}{tp+fp+tn+fn}} \end{align}Header('Train Set')
tn, fp, fn, tp = CM_Train.ravel()
Precision = tp/(tp+fp)
Recall = tp/(tp + fn)
TPR = tp/(tp +fn)
TNR = tn/(tn +fp)
BA = (TPR + TNR)/2
PPCR = (tp + fp)/(tp + fp + tn+ fn)
print('Precision (Train) = %.2f' % Precision)
print('Recall (Train) = %.2f' % Recall)
print('TPR (Train) = %.2f' % TPR)
print('TNR (Train) = %.2f' % TNR)
print('Balanced Accuracy (Train) = %.2f' % BA)
print('Predicted Positive Condition Rate (Train) = %.2f' % PPCR)
Header('Test Set')
tn, fp, fn, tp = CM_Test.ravel()
Precision = tp/(tp+fp)
Recall = tp/(tp + fn)
TPR = tp/(tp +fn)
TNR = tn/(tn +fp)
BA = (TPR + TNR)/2
PPCR = (tp + fp)/(tp + fp + tn+ fn)
print('Precision (Test) = %.2f' % Precision)
print('Recall (Test) = %.2f' % Recall)
print('TPR (Test) = %.2f' % TPR)
print('TNR (Test) = %.2f' % TNR)
print('Balanced Accuracy (Test) = %.2f' % BA)
print('Predicted Positive Condition Rate (Test) = %.2f' % PPCR)
del tn, fp, fn, tp, Precision, Recall, TPR, TNR, BA, PPCR
Line()
Train Set ========================================================================================== Precision (Train) = 0.65 Recall (Train) = 0.50 TPR (Train) = 0.50 TNR (Train) = 0.97 Balanced Accuracy (Train) = 0.73 Predicted Positive Condition Rate (Train) = 0.09 Test Set =========================================================================================== Precision (Test) = 0.64 Recall (Test) = 0.48 TPR (Test) = 0.48 TNR (Test) = 0.97 Balanced Accuracy (Test) = 0.72 Predicted Positive Condition Rate (Test) = 0.08 ====================================================================================================
Sample = df.sample(frac = 0.1)
X_sample = Sample.drop(columns = [Target]).values
X_sample = scaler.transform(X_sample)
if torch.cuda.is_available():
X_sample_tensor = Variable(torch.from_numpy(X_sample).cuda())
else:
X_sample_tensor = Variable(torch.from_numpy(X_sample))
Labels = ['No', 'Yes']
y_pred = model(X_sample_tensor.float())
y_pred = np.asarray(y_pred.cpu().detach().numpy())
y_pred = pd.Series(y_pred.argmax(axis=1)).to_frame('Term Deposit Subscription (Predicted)').applymap(lambda x: Labels[0] if x ==0 else Labels[1])
Predictions = pd.concat([Data, y_pred], axis = 1).dropna(subset = ['Term Deposit Subscription (Predicted)'])
display(Predictions)
| Age | Job | Marital | Education | Default | Housing | Loan | Contact | Month | Day Of Week | ... | Pdays | Previous | Poutcome | Employment Variation Rate | Consumer Price Index | Consumer Confidence Index | Euribor three Month Rate | Number of Employees | Term Deposit Subscription | Term Deposit Subscription (Predicted) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 56 | Housemaid | Married | Basic.4Y | No | No | No | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | No | No |
| 1 | 57 | Services | Married | High.School | Unknown | No | No | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | No | No |
| 2 | 37 | Services | Married | High.School | No | Yes | No | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | No | No |
| 3 | 40 | Admin. | Married | Basic.6Y | No | No | No | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | No | No |
| 4 | 56 | Services | Married | High.School | No | No | Yes | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | No | No |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4114 | 52 | Entrepreneur | Married | University.Degree | No | No | No | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.858 | 5191.0 | No | No |
| 4115 | 55 | Services | Divorced | High.School | No | No | Yes | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.858 | 5191.0 | No | No |
| 4116 | 24 | Services | Single | High.School | No | No | No | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.858 | 5191.0 | No | No |
| 4117 | 46 | Admin. | Divorced | High.School | Unknown | No | Yes | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.858 | 5191.0 | No | No |
| 4118 | 31 | Admin. | Divorced | University.Degree | No | No | No | Telephone | May | Monday | ... | 999 | 0 | Nonexistent | 1.1 | 93.994 | -36.4 | 4.858 | 5191.0 | No | No |
4119 rows × 22 columns
Although the model is doing pretty well considering the complexity of this problem, we can improve the results by designing an iterative optimization that utilizes the accuracy and recall scores.
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS. [bank.zip]
Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.